{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
        "Licensed under the MIT License.\n",
        "\n",
        "# Deploy high performance question-answer model on AzureML with ONNX Runtime\n",
        "\n",
        "This tutorial takes a pre-trained BERT model, converts it to ONNX, and deploys the ONNX model with ONNX Runtime through AzureML.\n",
        "\n",
        "In the following sections, we use the HuggingFace Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. You can also train or fine-tune your own question answer model.\n",
        "\n",
        "The question answer scenario takes a question and a piece of text called a `context`, and produces answer, which is a string of text taken from the context. This scenario tokenizes and encodes the question and the context, feeds the inputs into the transformer model and generates the answer by producing the most likely start and end tokens in the context, which are then mapped back into words.\n",
        "\n",
        "\n",
        "## Contents\n",
        "\n",
        "**Prerequisites** to set up your Azure ML work environments\n",
        "\n",
        "**Obtain model and convert to ONNX**\n",
        "\n",
        "**Deploy model using ONNX Runtime and AzureML**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prerequisites\n",
        "\n",
        "To run on AzureML, you need:\n",
        "* Azure subscription\n",
        "* Azure Machine Learning Workspace (see this notebook for creation of the workspace if you do not already have one: [AzureML configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/56e0ebc5acb9614fac51d8b98ede5acee8003820/configuration.ipynb))\n",
        "* the Azure Machine Learning SDK\n",
        "* the Azure CLI and the Azure Machine learning CLI extension (> version 2.2.2)\n",
        "\n",
        "You might also find the following resources useful:\n",
        "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
        "* The [Azure Portal](https://portal.azure.com) allows you to track the status of your deployments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1649374486622
        },
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [],
      "source": [
        "# To install dependencies directly run the following\n",
        "%pip install torch\n",
        "%pip install transformers\n",
        "%pip install azureml azureml.core\n",
        "%pip install onnxruntime\n",
        "%pip install matplotlib\n",
        "\n",
        "# To create a a Jupter kernel from your conda environment, run the following. replacing <kernel name> with your own name\n",
        "#   conda install -c anaconda ipykernel\n",
        "#   python -m ipykernel install --user --name=<kernel name>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Install the AzureML CLI extension, which is used in the deployment steps below"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!az login\n",
        "!az extension add --name ml\n",
        "# Remove the azure-cli-ml extension if it is installed, as it is not compatible with the az ml extension\n",
        "!az extension remove azure-cli-ml"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Obtain and convert PyTorch model to ONNX format\n",
        "\n",
        "In the code below, we obtain a BERT model fine-tuned for question answering with the SQUAD dataset from HuggingFace.\n",
        "\n",
        "If you'd like to pre-train a BERT model from scratch, follow the instructions in\n",
        "[Pretraining of the BERT model](https://github.com/microsoft/AzureML-BERT/blob/master/pretrain/PyTorch/notebooks/BERT_Pretrain.ipynb). \n",
        "And if you would like to fine-tune the model with your own dataset, refer to  [AzureML Bert Eval Squad](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_SQUAD.ipynb)\n",
        "or [AzureML Bert Eval GLUE](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_GLUE.ipynb).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "### Export the model\n",
        "\n",
        "Use the PyTorch ONNX exporter to create a model in ONNX format, to be run with ONNX Runtime."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1649374524671
        },
        "jupyter": {
          "outputs_hidden": true
        }
      },
      "outputs": [],
      "source": [
        "%%writefile export.py\n",
        "\n",
        "import torch\n",
        "from transformers import BertForQuestionAnswering\n",
        "\n",
        "model_name = \"bert-large-uncased-whole-word-masking-finetuned-squad\"\n",
        "model_path = \"./\" + model_name + \".onnx\"\n",
        "model = BertForQuestionAnswering.from_pretrained(model_name)\n",
        "\n",
        "# set the model to inference mode\n",
        "# It is important to call torch_model.eval() or torch_model.train(False) before exporting the model, \n",
        "# to turn the model to inference mode. This is required since operators like dropout or batchnorm \n",
        "# behave differently in inference and training mode.\n",
        "model.eval()\n",
        "\n",
        "# Generate dummy inputs to the model. Adjust if neccessary\n",
        "inputs = {\n",
        "        'input_ids':   torch.randint(32, [1, 32], dtype=torch.long), # list of numerical ids for the tokenized text\n",
        "        'attention_mask': torch.ones([1, 32], dtype=torch.long),     # dummy list of ones\n",
        "        'token_type_ids':  torch.ones([1, 32], dtype=torch.long)     # dummy list of ones\n",
        "    }\n",
        "\n",
        "symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\n",
        "torch.onnx.export(model,                                         # model being run\n",
        "                  (inputs['input_ids'],\n",
        "                   inputs['attention_mask'], \n",
        "                   inputs['token_type_ids']),                    # model input (or a tuple for multiple inputs)\n",
        "                  model_path,                                    # where to save the model (can be a file or file-like object)\n",
        "                  opset_version=11,                              # the ONNX version to export the model to\n",
        "                  do_constant_folding=True,                      # whether to execute constant folding for optimization\n",
        "                  input_names=['input_ids',\n",
        "                               'input_mask', \n",
        "                               'segment_ids'],                   # the model's input names\n",
        "                  output_names=['start_logits', \"end_logits\"],   # the model's output names\n",
        "                  dynamic_axes={'input_ids': symbolic_names,\n",
        "                                'input_mask' : symbolic_names,\n",
        "                                'segment_ids' : symbolic_names,\n",
        "                                'start_logits' : symbolic_names, \n",
        "                                'end_logits': symbolic_names})   # variable length axes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%run export.py"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "## Run the ONNX model with ONNX Runtime\n",
        "\n",
        "The following code runs the ONNX model with ONNX Runtime. You can test it locally before deploying it to Azure Machine Learning.\n",
        "\n",
        "The `init()` function is called at startup, performing the one-off operations such as creating the tokenizer and the ONNX Runtime session.\n",
        "\n",
        "The `run()` function is called when we run the model using the Azure ML endpoint.\n",
        "Add necessary `preprocess()` and `postprocess()` steps.\n",
        "\n",
        "For local testing and comparison, you can also run the PyTorch model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [],
      "source": [
        "%%writefile score.py\n",
        "import os\n",
        "import logging\n",
        "import json\n",
        "import numpy as np\n",
        "import onnxruntime\n",
        "import transformers\n",
        "import torch\n",
        "\n",
        "# The pre process function take a question and a context, and generates the tensor inputs to the model:\n",
        "# - input_ids: the words in the question encoded as integers\n",
        "# - attention_mask: not used in this model\n",
        "# - token_type_ids: a list of 0s and 1s that distinguish between the words of the question and the words of the context\n",
        "# This function also returns the words contained in the question and the context, so that the answer can be decoded into a phrase. \n",
        "def preprocess(question, context):\n",
        "    encoded_input = tokenizer(question, context)\n",
        "    tokens = tokenizer.convert_ids_to_tokens(encoded_input.input_ids)\n",
        "    return (encoded_input.input_ids, encoded_input.attention_mask, encoded_input.token_type_ids, tokens)\n",
        "\n",
        "# The post process function takes the list of tokens in the question and context, as well as the output of the \n",
        "# model, the list of log probabilities for the choices of start and end of the answer, and maps it back to an\n",
        "# answer to the question that is asked of the context.\n",
        "def postprocess(tokens, start, end):\n",
        "    results = {}\n",
        "    answer_start = np.argmax(start)\n",
        "    answer_end = np.argmax(end)\n",
        "    if answer_end >= answer_start:\n",
        "        answer = tokens[answer_start]\n",
        "        for i in range(answer_start+1, answer_end+1):\n",
        "            if tokens[i][0:2] == \"##\":\n",
        "                answer += tokens[i][2:]\n",
        "            else:\n",
        "                answer += \" \" + tokens[i]\n",
        "        results['answer'] = answer.capitalize()\n",
        "    else:\n",
        "        results['error'] = \"I am unable to find the answer to this question. Can you please ask another question?\"\n",
        "    return results\n",
        "\n",
        "# Perform the one-off intialisation for the prediction. The init code is run once when the endpoint is setup.\n",
        "def init():\n",
        "    global tokenizer, session, model\n",
        "\n",
        "    model_name = \"bert-large-uncased-whole-word-masking-finetuned-squad\"\n",
        "    model = transformers.BertForQuestionAnswering.from_pretrained(model_name)\n",
        "\n",
        "    # use AZUREML_MODEL_DIR to get your deployed model(s). If multiple models are deployed, \n",
        "    # model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), '$MODEL_NAME/$VERSION/$MODEL_FILE_NAME')\n",
        "    model_dir = os.getenv('AZUREML_MODEL_DIR')\n",
        "    if model_dir == None:\n",
        "        model_dir = \"./\"\n",
        "    model_path = os.path.join(model_dir, model_name + \".onnx\")\n",
        "\n",
        "    # Create the tokenizer\n",
        "    tokenizer = transformers.BertTokenizer.from_pretrained(model_name)\n",
        "\n",
        "    # Create an ONNX Runtime session to run the ONNX model\n",
        "    session = onnxruntime.InferenceSession(model_path, providers=[\"CPUExecutionProvider\"])  \n",
        "\n",
        "\n",
        "# Run the PyTorch model, for functional and performance comparison\n",
        "def run_pytorch(raw_data):\n",
        "    inputs = json.loads(raw_data)\n",
        "\n",
        "    model.eval()\n",
        "\n",
        "    logging.info(\"Question:\", inputs[\"question\"])\n",
        "    logging.info(\"Context: \", inputs[\"context\"])\n",
        "\n",
        "    input_ids, input_mask, segment_ids, tokens = preprocess(inputs[\"question\"], inputs[\"context\"])\n",
        "\n",
        "    model_outputs = model(torch.tensor([input_ids]),  token_type_ids=torch.tensor([segment_ids]))\n",
        "\n",
        "    return postprocess(tokens, model_outputs.start_logits.detach().numpy(), model_outputs.end_logits.detach().numpy())\n",
        "\n",
        "# Run the ONNX model with ONNX Runtime\n",
        "def run(raw_data):\n",
        "    logging.info(\"Request received\")\n",
        "\n",
        "    inputs = json.loads(raw_data)\n",
        "\n",
        "    logging.info(inputs)\n",
        "\n",
        "    # Preprocess the question and context into tokenized ids\n",
        "    input_ids, input_mask, segment_ids, tokens = preprocess(inputs[\"question\"], inputs[\"context\"])\n",
        "\n",
        "    logging.info(\"Running inference\")\n",
        "    \n",
        "    # Format the inputs for ONNX Runtime\n",
        "    model_inputs = {\n",
        "        'input_ids':   [input_ids], \n",
        "        'input_mask':  [input_mask],\n",
        "        'segment_ids': [segment_ids]\n",
        "        }\n",
        "                  \n",
        "    outputs = session.run(['start_logits', 'end_logits'], model_inputs)\n",
        "\n",
        "    logging.info(\"Post-processing\")  \n",
        "\n",
        "    # Post process the output of the model into an answer (or an error if the question could not be answered)\n",
        "    results = postprocess(tokens, outputs[0], outputs[1])\n",
        "\n",
        "    logging.info(results)\n",
        "\n",
        "    return results\n",
        "\n",
        "\n",
        "if __name__ == '__main__':\n",
        "    init()\n",
        "\n",
        "    input = \"{\\\"question\\\": \\\"What is Dolly Parton's middle name?\\\", \\\"context\\\": \\\"Dolly Rebecca Parton is an American singer-songwriter\\\"}\"\n",
        "\n",
        "    run_pytorch(input)\n",
        "    print(run(input))\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [],
      "source": [
        "%run score.py"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Deploy model with ONNX Runtime through AzureML\n",
        "\n",
        "Now that we have the ONNX model and the code to run it with ONNX Runtime, we can deploy it using Azure ML.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Check your environment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1649374558896
        }
      },
      "outputs": [],
      "source": [
        "# Check core SDK version number\n",
        "import azureml.core\n",
        "import onnxruntime\n",
        "import torch\n",
        "import transformers\n",
        "\n",
        "print(\"Transformers version: \", transformers.__version__)\n",
        "torch_version = torch.__version__\n",
        "print(\"Torch (ONNX exporter) version: \", torch_version)\n",
        "print(\"Azure SDK version:\", azureml.core.VERSION)\n",
        "print(\"ONNX Runtime version: \", onnxruntime.__version__)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Load your Azure ML workspace\n",
        "\n",
        "We begin by instantiating a workspace object from the existing workspace created earlier in the configuration notebook.\n",
        "\n",
        "Note that the following code assumes you have a `config.json` file containing the subscription information in the same directory as the notebook, or in a sub-directory called `.azureml`. You can also supply the workspace name, subscription name, and resource group explicity using the `Workspace.get()` method."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "from azureml.core import Workspace\n",
        "\n",
        "ws = Workspace.from_config()\n",
        "print(ws.name, ws.location, ws.resource_group, ws.subscription_id, sep = '\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Register your model with Azure ML\n",
        "\n",
        "Now we upload the model and register it in the workspace.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1649374588035
        }
      },
      "outputs": [],
      "source": [
        "from azureml.core.model import Model\n",
        "\n",
        "model = Model.register(model_path = model_path,                 # Name of the registered model in your workspace.\n",
        "                       model_name = model_name,            # Local ONNX model to upload and register as a model\n",
        "                       model_framework=Model.Framework.ONNX ,   # Framework used to create the model.\n",
        "                       model_framework_version=torch_version,   # Version of ONNX used to create the model.\n",
        "                       tags = {\"onnx\": \"demo\"},\n",
        "                       description = \"HuggingFace Bert model fine-tuned with SQuAd and exported from PyTorch\",\n",
        "                       workspace = ws)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Display your registered models\n",
        "\n",
        "You can list out all the models that you have registered in this workspace."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1649374588649
        }
      },
      "outputs": [],
      "source": [
        "models = ws.models\n",
        "for name, m in models.items():\n",
        "    print(\"Name:\", name,\"\\tVersion:\", m.version, \"\\tDescription:\", m.description, m.tags)\n",
        "    \n",
        "#     # If you'd like to delete the models from workspace\n",
        "#     model_to_delete = Model(ws, name)\n",
        "#     model_to_delete.delete()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Deploy the model \n",
        "\n",
        "We are now going to deploy our ONNX model on Azure ML using ONNX Runtime.\n",
        "\n",
        "Firstly we will test the deployment using an Azure Container Instance, then deploy the model for production using an Azure ML endpoint.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Deploy Model and Scoring code as an Azure ML endpoint\n",
        "\n",
        "Note that the endpoint interface of the Python SDK has not been publicly released yet, so for this section, we will use the Azure ML CLI.\n",
        "\n",
        "There are three YML files in the `yml` folder:\n",
        "* `env.yml`: A conda environment specification, from which the execution environment of the endpoint will be generated\n",
        "* `endpoint.yml`: The endpoint specification, which simply contains the name of the endpoint and the authorization method\n",
        "* `deployment.yml`: The deployment specification, which contains specifications of the scoring code, model, and environment. You can create multiple deployments per endpoint, and route different amounts of traffic to the deployments. For this example, we will create only one deployment.\n",
        "\n",
        "The deployment can take up to 15 minutes. Note also that all of the files in the directory with the notebook will be uploaded into the docker container that forms the basis of your endpoint, including any local copies of the ONNX model (which has already been deployed to AzureML in the previous step). To reduce the deployment time remove any local copies of large files, before creating the endpoint."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!az ml online-endpoint create --name question-answer-ort --file yml/endpoint.yml --subscription {ws.subscription_id} --resource-group {ws.resource_group} --workspace-name {ws.name} \n",
        "!az ml online-deployment create --endpoint-name question-answer-ort --name blue --file yml/deployment.yml --all-traffic --subscription {ws.subscription_id} --resource-group {ws.resource_group} --workspace-name {ws.name} "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Test the deployed endpoint\n",
        "\n",
        "The following cell runs the deployed question answer model. There is a test question in the `test-data.json` file. You can edit this file with your own question and context."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!az ml online-endpoint invoke --name question-answer-ort --request-file test-data.json --subscription {ws.subscription_id} --resource-group {ws.resource_group} --workspace-name {ws.name} "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Success!\n",
        "\n",
        "If you've made it this far, you've deployed a working endpoint that answers a question using an ONNX model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Clean up Azure resources\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!az ml online-endpoint delete --name question-answer-ort --yes --subscription {ws.subscription_id} --resource-group {ws.resource_group} --workspace-name {ws.name} "
      ]
    }
  ],
  "metadata": {
    "interpreter": {
      "hash": "f0528bd5f6ac7ac52612a9005b50b1458e2b7d1852103f3c7d5ad6e0d876b262"
    },
    "kernel_info": {
      "name": "gpu-kernel"
    },
    "kernelspec": {
      "display_name": "gpu-kernel",
      "language": "python",
      "name": "gpu-kernel"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.7"
    },
    "microsoft": {
      "host": {
        "AzureML": {
          "notebookHasBeenCompleted": true
        }
      }
    },
    "nteract": {
      "version": "nteract-front-end@1.0.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
